Description

This R notebook is a bioinformatics pipeline to analyze data obtained by sorting a barcoded transposon library in Ralstonia eutropha (a.k.a. Cupriavidus necator) by density, achieving a selection by PHB content. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018). This notebook has been adapted from a pipeline for analysis of chemostat competition experiments by Michael Jahn.

The dataset is a NextSeq 500 run, from two experiments with PHB sorting by centrifuge with either abundant or no nitrogen (NH4Cl). A problem (see below) was that the diversity in the libraries was much worse than we had seen before, especially in the N-starved sample. For this reason, the N-starvation experiment was redone (see another notebook) and only the N-rich condition was used.

Load data

Libraries

Overview of barcode/transposon read counts

Data import and processing

Read in the main data tables with A) reads per barcode and sample (‘pool counts’), B) the fitness tables, C) the summary statistics, and D) the genome annotation. Tables were obtained by processing sequencing data with a custom BarSeq pipeline.

Get gene annotation and esssentiality

Summary statistics

Overview about the number of reads per sample and their mapping. The proportion of mapped reads is lower than normal (~75 %) in all samples. While N+ samples and N- no 3 have ca 35 % mapped reads (already low), the remaining N- are around 9%. Notably N+ and N- no 3 was done in a different experiment compared to the remaining N- samples, with a different pre-cultures inoculated at different days. This suggest the low mapping was caused by the experiment, rather than during sequencing.

Manual inspection of the reads show that the vast majority of the reads have normal read structure, i. e. the presequence, a 20-nt barcode and the post-sequence. The problem is that the barcode could not be mapped to a known barcode from the TnSeq. Let’s inspect the actual barcodes we got


Barcode diversity

To investigate the barcode distribution, we compare samples from the low- and high-mapping samples to a sample from the previous sucrose-formate MC experiment. Here we used unsorted F0 samples, that should not have been enriched for PHB-related mutants.

The total number of barcodes observed is similar between samples (ca 10M), but the number off mappable barcodes is much less in the new samples. In the new samples, there are a few barcodes with extremely high read counts, so that the 100 most abundant ones make upp more than 75% of the total reads. Conversely, there are few barcodes with what would normally be “high” counts, i. e. over 100 000.

Based on this, there could be said to be two separate problems.

  • Extreme enrichment of a few barcodes, leading to read compression
  • General enrichment of un-mappable barcodes

Michael has inspected some of the un-mapped barcodes manually. He found both those that were not seen by the TnSeq data at all, and those that were seen but mapped to several locations (likely due to short read length). This means that the un-mapped codes are likely real strains in the Tn library, but that a small group of mutants have been overly enriched.

Which are these barcodes? Is it the same over-enriched and unmappable barcodes between the samples, and between the experiments?

Here we plot the the common genes among the 200 most seen BarSeq mutants in two experiments. We see that the in the two N-starved replicates, the reads are almost identical. In the N-rich samples, the read numbers are highly correlated but about half the mutants are about 30 % more prevalent in one of the samples. The reason for this is unknown.

Between the N-starved and N-rich samples, we also see that it is largely the same mutants that are enriched. Between the new samples and the old sample however, there were only few mutants that appeared together.

The conclusion is that the enrichment of these mutants depends on some common factor between the two different experiments performed before centrifuge sorting, but that these factors were not present in the MC experiment. The cause of these problems is likely somewhere in the experiment, not in the seq library preparation or in the sequencing. Two theories are

  • Repeated freeze-thaw cycles of the frozen library has biased the survivors.
  • The preculture phases have selected has biased the library

Distribution of number of reads per barcode. Many samples have less than few barcodes with more than 2^5 = 32 barcodes, will be hard to quantify these. Blue is the mean in each sample.

Similarly to the above, this is an overview about the number of barcodes per gene as a histogram. This distribution is the same for all conditions and replicates. The second plot is the number of reads per gene, averaged as median over all conditions (excluding 0 time point where counts were averaged by BarSeq pipeline).

The average of detected barcodes per gene is 3.8, and the median is 3. The average reads per gene is 384, and the median is 15.5.

There are somewhat more reads per gene if we include only the N-rich samples. The average of detected barcodes per gene is 3.7, and the median is 3. The average reads per gene is 499, and the median is 18.

Gene fitness analysis

We can plot log2 FC or normalized gene fitness over generations. For this type of overview it is best to summarize individual replicates (3x) to the mean or median, per time point and condition. We also add genome annotation to the summary table.

As sort of an internal control, we compare the gene fitness obtained by a complex procedure to the log2 FC of read counts, which is a very simple measure of ‘fitness’. The two variables correlate well, although there is often an offset between replicates.

Most enriched dense mutants in N-rich conditions

Cells producing PHB become denser, meaning they fall through the density gradient to the middle or bottom fractions (F2 and F3). If look at which strains that are enriched here, i. e. cells producing PHB even when not N-starved, there were several significant hits. Interestingly, the two strongest hits in both conditions are the same and related: leuC and leuB. Otherwise, most of the hits were just present in one medium, and there are more significant scores (>2.5) in the formate condition.

Here are the 60 most F3-enriched genes:

If we only look at non-essential genes:

The lists of fructose- and formate-essential genes are identical in this case. The 8-generation score from the M experiments is plotted in each subpanel.

Most depleted light cells

When N is abundant, wt cell produce little PHB and normaly band in the top fraction after centrifugation. What should be equivalent is to select mutants depleted in the F1 fraction. The leuBC genes are seen here as well.

Here are the 60 most F1-depleted genes:

If we only look at non-essential genes: The fructose- and formate-essential genes are identical also here. The 8-generation score from the MC is plotted in each subpanel.

Protein interactions

To learn more about functional relationships between enriched genes/mutants, we can submit the gene list to the STRING interaction database and retrieve a network of probable interactions. Copied from Micheals BarSeq-pulse.

Get the most enriched genes and submit to STRING.

Leucine biosynthesis

There is enrichment of some leucine synthesis genes in N-rich sinking cells. If we display all leucine synthesis genes, we see that only leuB, leuC and leuD are affected. They are depleted in floating cells and enriched in sinking cells. There is a such trend also in N-starvation conditions. None of these genes are obviously related to PHB synthesis.

## # A tibble: 15 x 4
##    locus_tag gene_name         Process           Pathway                        
##    <chr>     <chr>             <chr>             <chr>                          
##  1 H16_A0477 leuB1 H16_A0477   Carbohydrate met
 Glyoxylate and dicarboxylate m

##  2 H16_A0561 ilvE H16_A0561    Amino acid metab
 Cysteine and methionine metabo

##  3 H16_A1041 leuA H16_A1041    Amino acid metab
 Valine, leucine and isoleucine

##  4 H16_A1236 leuC leuC1 H16_A
 Amino acid metab
 Valine, leucine and isoleucine

##  5 H16_A1237 leuD leuD1 H16_A
 Amino acid metab
 Valine, leucine and isoleucine

##  6 H16_A1549 leuC2 H16_A1549   Amino acid metab
 Valine, leucine and isoleucine

##  7 H16_A2133 leuB2 H16_A2133   Carbohydrate met
 Glyoxylate and dicarboxylate m

##  8 H16_A2619 leuB3 leuB H16_A
 Amino acid metab
 Valine, leucine and isoleucine

##  9 H16_A2620 leuD leuD3 H16_A
 Amino acid metab
 Valine, leucine and isoleucine

## 10 H16_A2621 leuC3 leuC H16_A
 Amino acid metab
 Valine, leucine and isoleucine

## 11 H16_A2999 cysI1 H16_A2999   Energy metabolism Sulfur metabolism              
## 12 H16_B0052 leuC4 H16_B0052   Amino acid metab
 Valine, leucine and isoleucine

## 13 H16_B0081 leuA2 leuA H16_B
 Amino acid metab
 Valine, leucine and isoleucine

## 14 H16_B1581 metE H16_B1581    Amino acid metab
 Cysteine and methionine metabo

## 15 H16_B2275 leuC5 leuC H16_B
 Amino acid metab
 Valine, leucine and isoleucine


The leuBCD genes share a common operon, with only these three genes. To check whether it could be a wider genomic effect, we can look at the surrounding genes. There is no data for flanking asd gene, but neither of the neighbors on either side fimV or livF show any enrichment or depletion on the gradient.

Hydrogenases

There are several hydrogenases that have a pattern where the in the fructose sample it is depleted in the floating fraction and in the formate sample is enriched in the sinking fraction. These are, to varying degrees, hoxA, hypB, hypE, hypF and the unnamed PHG390.

## # A tibble: 15 x 4
##    locus_tag gene_name      Process                       Pathway               
##    <chr>     <chr>          <chr>                         <chr>                 
##  1 PHG001    hoxK PHG001    Xenobiotics biodegradation a
 Nitrotoluene degradat

##  2 PHG002    hoxG PHG002    Xenobiotics biodegradation a
 Nitrotoluene degradat

##  3 PHG008    hoxQ PHG008    Poorly characterized          General function pred

##  4 PHG012    hypA PHG012    Genetic information processi
 Chaperones and foldin

##  5 PHG013    hypB PHG013    Genetic information processi
 Chaperones and foldin

##  6 PHG014    hypF1 hypF PH
 Unclassified: genetic inform
 Protein processing    
##  7 PHG015    hypC PHG015    Unclassified: genetic inform
 Protein processing    
##  8 PHG016    hypD PHG016    Unclassified: genetic inform
 Protein processing    
##  9 PHG017    hypE PHG017    Unclassified: genetic inform
 Protein processing    
## 10 PHG018    hypX PHG018    Signaling and cellular proce
 Two-component system  
## 11 PHG019    hoxA PHG019    Signal transduction           Two-component system  
## 12 PHG020    hoxB PHG020    Hypothetical                  Hypothetical          
## 13 PHG021    hoxC PHG021    Xenobiotics biodegradation a
 Nitrotoluene degradat

## 14 PHG089    hoxU PHG089    Unclassified: metabolism      Enzymes with EC numbe

## 15 PHG390    PHG390         Hypothetical                  Hypothetical

A collection of other, mosty little characterized genes

## # A tibble: 5 x 4
##   locus_tag gene_name Process          Pathway                
##   <chr>     <chr>     <chr>            <chr>                  
## 1 H16_A3183 H16_A3183 Hypothetical     Hypothetical           
## 2 H16_A3741 H16_A3741 Hypothetical     Hypothetical           
## 3 H16_B2043 H16_B2043 Hypothetical     Hypothetical           
## 4 H16_B2171 H16_B2171 Lipid metabolism Sphingolipid metabolism
## 5 PHG390    PHG390    Hypothetical     Hypothetical